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Abstract — The Arimoto-Blahut algorithm for computing the 
capacity of a discrete memoryless channel is revisited. A so-called 
"squeezing" strategy is used to design algorithms that preserve 
its simplicity and monotonic convergence properties, but have 
provably better rates of convergence. 

Index Terms — alternating minimization; channel capacity; dis- 
crete memoryless channel; rate of convergence. 



I. Introduction 

The Arimoto-Blahut Algorithm JT], (ABA) plays a fun- 
damental role in numerical calculations of channel capacities. 
This iterative scheme has an appealing geometric interpreta- 
tion ([5 1), and possesses a desirable monotonic convergence 
property. We refer to H3, (BJ, (BJ, 0. ED for extensions 
and improvements. 

We study variants of ABA with an aim to speed up the 
convergence while maintaining the simplicity. The focus is on 
the discrete memoryless channel, and on theoretical properties; 
extensions and further numerical results will be reported in 
future works. Our investigation relies on certain reformula- 
tions that slightly generalize the original capacity calculation 
problem. Each formulation leads to an Arimoto-Blahut-type 
algorithm, which is monotonically convergent, and typically 
as easily implemented as the original ABA. A formula for the 
rate of convergence provides valuable insight as to when ABA 
is slow. Comparison theorems show that our constructions 
are at least as fast as the usual ABA as measured by the 
global convergence rate. Numerical examples show that the 
improvement can be substantial. 

Our approach differs from other acceleration methods for 
ABA (e.g., the proximal point formulation of [10|) in that 
we focus on preprocessing or "reparameterizing" the prob- 
lem (Sections III and IV). Such reparameterizations, broadly 
termed "squeezing," aim at reducing the overlap between rows 
of the channel matrix. Our technical contributions include 
the monotonic convergence theorem of Section IV, and the 
convergence rate comparison theorems of Section V. These 
theoretical results are illustrated with simple examples. 

II. Variants of Arimoto-Blahut 

A discrete memoryless channel is associated with an m x n 
transition matrix W = (Wij), i.e., Wij specifies the prob- 
ability of receiving the output letter j if the input is i. 
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Mathematically W tJ > and £\ W l0 = 1 for all i. The 
information capacity is defined as 



sup I(p), Hp) = y>D(Wi||pW). 
pen *-? 



(1) 



See 0, for interpretations of this fundamental quantity. 
Throughout il denotes the probability simplex 

Sl = {p=(pi,... ,p m ) : Pt>0, pl m = 1}, 

l m denotes the m X 1 vector of ones, Wi denotes the ith 
row of W, i.e., W t = (Wn, . . . , W in ), and D(q\\r) = 
J2i ( li^°s(li/ r i) f° r nonnegative vectors q = (%) and r = 
(fj). We use natural logarithm (except for Fig. 3) and obey 
the convention 01og(0/a) = 0, a > 0. Let us also define 
H(q) = — Y^; Qi logQi for a nonnegative vector q = (qi). It is 
not required that £\ qi = 1. Without loss of generality assume 
that not all rows of W are equal, and that none of its columns 
is identically zero. 

An example of our general class of algorithms for solving 
<£XJ is as follows. Let A 6 R satisfy 

(2) 



1 < A < 



1 -X), mhii W i3 



Algorithm I: Singly Squeezed ABA. Choose e Q such 
that pf^ > for all i. For t = 0, 1, . . ., calculate p( t+1 ' as 
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(3) 

Iterate until convergence. 

One recognizes Algorithm I as a generalization of the orig- 
inal Arimoto-Blahut Algorithm, which corresponds to A = 1. 
This simple generalization has been considered before (see, 
e.g., [10]). What is new is the constraint ©. Under this 
constraint, Algorithm I is guaranteed to converge monoton- 
ically (Section IV), and its convergence rate is no worse than 
that of ABA (Section V). The nickname reflects our intuitive 
interpretation of Algorithm I and is explained near the end of 
Section III. 

Example 1. Consider the channel matrix 



W = 



0.7 0.2 0.1 
0.1 0.2 0.7 



which is also used by [|T0l as an illustration. Let us choose 
A = 5/3, which attains the upper bound in (0. Fig. 1 compares 
the iterations p\ , t = 1, 2, ... , produced by ABA and by 
Algorithm I with A = 5/3. Each algorithm is started at 
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Fig. 1. Iterations of pj ' for ABA ("-0-") and Algorithm I ("-X-") with 
A = 5/3. 



= (1/3, 2/3). Algorithm I, however, appears to approach 
the target p* = (1/2, 1/2) faster than ABA. Different starting 
values give similar comparisons. 

Algorithm I is a special case of the following class of 
algorithms. Henceforth define 

il(r) = {pefl: p>r} 

for any 1 x m vector r > 0. For vectors (matrices) A and B 
of the same dimension, A > B means every entry of A — B 
is nonnegative. The to x to identity matrix is written as I m . 
Let r be a nonnegative 1 x to vector such that 



W > l m rW. 
Define r + — rl m . Let A (a scalar) satisfy 



1 



l-r 4 



< A < 



1 



1 - V miniW-^ 
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Algorithm II: Doubly Squeezed ABA. Choose p^ £ f2(r 
such that > for all i. For t = 0, 1, . . ., calculate 



p\ — max 



{r<, **>p<*>e«p (**<*>)} 



(6) 



where 



and 5^' is chosen such that Y^iP 
output 



(<+!) 1 TT 

= 1. Upon convergence, 



P = 



A stopping criterion for practical implementation is (e > 0) 

(7) 



This is the same criterion as often used for ABA (0), and it 
is convenient since the quantities zf' are readily available at 
each iteration. 

A key requirement is (0). It implies, for example, 



< y^min Wij < 1, 



assuming that not all rows of W are equal. When to = 2, © 
becomes 



< 



IF 
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2.7 



1 - n - r 2 ji w^>Wvj Wij - W 2j 



and 



< 
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(8) 



(9) 



1 - n - r 2 j: w 2j >Wu TF 2 j - Wij 

For general to, the restrictions on r are less clear. See Section 
V for further discussion. 

If r = 0, then © reduces (0), showing Algorithm II as 
a generalization of Algorithm I. Compared with Algorithm I, 
Algorithm II is only slightly more difficult to implement. In 
©, determining 5^ is a form of waterfilling (O), which can 
be implemented in O(TOlogm) time. (A simple implemen- 
tation is included in Appendix A for completeness.) Hence 
the additional cost per iteration is minor. The improvement in 
convergence rate, however, can be substantial. 

Example 1 (continued). Consider Algorithm II with A = 
5/3 and r = (1/8, 1/8). Then ® and © are satisfied with 
equalities. Inspection of © reveals that we have p^ = 
(1/2, 1/2), regardless of the starting value p(°\ (It is easier 
to verify this with the equivalent form of Algorithm II in 
Section III.) That is, with this choice of A and r, Algorithm 
II converges in one step. 

The general validity of Algorithm II is verified in Section 
IV. The critical issue of which values of r and A lead to 
fast convergence is studied in Section V, where theoretical 
justifications are provided for the following guideline. For fast 
convergence, we should 

• set A at the upper bound in ©, and 

• let r/(l — r + ) be as large as possible, subject to the 
restriction (0J. 

For to = 2, this means that r should satisfy the equalities 
in (HJl and ©. Although Example 1 already hints at such a 
recommendation, we also conduct a simulation for illustration. 
Example 2. A channel matrix W with to = 2 and n = 8 



is generated according to 



i/J2k u ik where m 



are independent uniform(0, 1) variates. The original ABA, 
Algorithm I, and Algorithm II are compared. For Algorithm 
I, we set A at the upper bound in (O; for Algorithm II, 
we choose r/(l — r + ) to satisfy the upper bounds in ©- 
(0, and set A at the upper bound in (|5). The starting val- 
ues are p^ — (1/2,1/2) for ABA and Algorithm I, and 
= (1 - r+)(l/2, 1/2) + r for Algorithm II. We record 
the number of iterations until the common criterion (|7]i is met 
with e = 10~ 8 . The experiment is replicated 100 times. 

The improvement in speed by using Algorithm I or Algo- 
rithm II is evident from Fig. 2, which displays two bivariate 
plots of the numbers of iterations. While ABA sometimes takes 
hundreds of iterations, Algorithm I takes no more than 40, and 
Algorithm II no more than 16, throughout the 100 replications. 
The large reduction in the number of iterations is also shown 
in Fig. 3, which summarizes the log 2 acceleration ratios, 
defined as \og 2 (NABA/Ni) for Algorithm I, for example. 
Here Naba (resp. N]) denotes the number of iterations for 
ABA (resp. Algorithm I). The median acceleration ratio is 
4.0 for Algorithm I, and around 7.1 (2 2 83 ) for Algorithm II. 
The minimum acceleration ratio is 2.2 for Algorithm I and 
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Fig. 2. Comparing the numbers of iterations for three algorithms in Example 
2. 
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Fig. 3. Acceleration ratios in Example 2. 



Algorithm II 



2.8 for Algorithm II. Overall this supports the preference for 
large values of A and r/(l — r + ), subject to and (0, in 
implementing Algorithm II. 

Remark. One may still implement Algorithm II with some 
r, A that do not satisfy (|4]l or (0. For example, it is conceivable 
that values of A slightly exceeding the upper bound in (0 
could lead to even faster convergence. However, our theoretical 
results only guarantee convergence under (0 and (0. It is also 
intuitive that setting A too large would overshoot and no longer 
maintain mono tonic convergence. 

III. Equivalent form of Algorithm II 

Although Algorithm II is convenient for practical imple- 
mentation, we write it in an equivalent form (Algorithm III) 
to study the theoretical properties. 

Let r (1 X to) and / (1 xn)be nonnegative vectors that 
satisfy 

W = (1 + /+ /"' ~ lmT W - W > 0, r + =rl m <\, 



1 -r 4 



and /+ = fl n . Set 



(10) 



H(Wi), l<i<m. (11) 



Algorithm III: Doubly Squeezed ABA. Choose p^ e 
f2(r) such that p^ -* > for all i. For t = 0, 1, . . ., calculate 



4> 



(t) _ pfWij 



(12) 



(t+i) 
p\ = max 



[r u a (*)e^+^^ lo s^ ) }, (13) 



where is chosen such that YliPi = 1- Upon conver- 
gence, output 



_ r 



l-r+ 

The restriction ( [Tot can be broken down as 



r+ < 1, W* 



1-n 



1^ > 0, 



and 



(i + f+)w* - i m f > o. 



(14) 



(15) 



The restriction (TT~4b is a restatement of dU, while ( fT~5b is 
equivalent to 



fj < (l + f+)minW*, i = l,...,n. 



If we set 



A 



1 + 

1-T4 



(16) 



(17) 



then Algorithm III reduces to Algorithm II. Indeed, by sum- 
ming over j, (fT~6b leads to 



1 + 



from which we obtain the upper bound in (01. Moreover, after 



some algebra, the mapping p 



(0 



,(*+!) 



as specified by (fT2l - 



( fT3l reduces to (O. (A useful identity in this calculation is 
pW + f — X(p — r)W; see also Proposition Q] in Section IV.) 
Thus Algorithm III reduces to Algorithm II with A given by 

Conversely, suppose r and A satisfy and (0. If we define 



/ J - = [A(l-r + )-l], 



mini W* 



E fe mi n l W* k ' 

with W* given by ( fl4b . then ( TPTb is satisfied. We also deduce 
/j > and ( fTol l from (0. Thus Algorithm II is equivalent to 
Algorithm III with this choice of /. 

We shall show that Algorithm Willi converges monotoni- 
cally, and its convergence rate is no worse than that of ABA. 
Intuitively, ABA is slow when there exists a heavy overlap 
between rows of the channel matrix W. Algorithm III, which 
works with W rather than W, can be seen as trying to reduce 
this overlap. Its nickname is derived from the transformation 
([Tol l, which subtracts, or "squeezes out," a nonnegative vector 
from each row of W. If r = 0, then only a vector proportional 
to / is subtracted. But Algorithm III with r = is equivalent to 
Algorithm II with r = 0, which is simply Algorithm I. Hence 
Algorithm I is called "Singly Squeezed ABA". For general r 
and /, we squeeze out both a vector proportional to / and 
another one proportional to rW. Hence Algorithm II/III is 
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called "Doubly Squeezed ABA". The vector r also modifies 
the space f2 we work on, thus making rW separate from /. 

Example 1 (continued). Consider Algorithm III with 
r = (1/8, 1/8) and / = (0, 1/4, 0). This corresponds to 
Algorithm II with the same r and A = 5/3. By ( fTOb we have 



W 



1 
1 



The rows of W no longer overlap, i.e., WijW^j = for all j. 
Inspection of ([TZt and (\3[ reveals that we have $W = W T 
and pW — (1/2, 1/2), regardless of the starting value p(°>. 
Thus, as mentioned earlier, Algorithm II/III converges in one 
step. 

IV. Validity of Algorithm II/III 

Given an m x n stochastic matrix V, a 1 x n vector / > 0, 
a 1 x m vector c, and p £ let us define 

I(p\V,f,c) = Y^Pi( D ( V iWf+pV) + (H) + D(f\\f+ P V). 

■i 

Equivalently, 

I(p\V, f, c) = H( P V + f) + Y,M^-H(V l ))-H(f). (18) 

i 

We have I(p\ W, 0, 0) = I(p) as in ([T). However, there exist 
less obvious relations. Proposition [TJ is key to our derivation 
of Algorithm III. 

Proposition 1: Let r, /, W, and c = [c\, . . . , c m ) satisfy 
([Toll and CETJ. Then 



i(p\w,o,o) 



I(p\WJ,c) + H(f) 
+ log(l + /+) + 



(19) 



E l nH(W l ) 



l-r 4 



where p = (1 — r + )p + r. 
Proof: Noting 

pW + / = (1 + 

the claim follows from ( fT8l and routine calculations. ■ 
Relation ( fT9l impl ies that, in order to maximize 7(p W : 0, 0) 
over p £ J7, we may equivalently maximize I(p\W, /, c) over 
p £ f2(r), and then set p = (p — r)/(l — r + ). Let us consider 
solving this slightly more general problem. 

Problem I. Let W be an m x n stochastic matrix, let / > 
be a 1 x n vector, and let r, c be 1 x m vectors. Assume r > 
and r + = rl m < 1. Maximize /(plW 7 , /, c) over p G ^(0- 

Problem I can be handled by a straightforward exten- 
sion of ABA. Following (UJ, (2), we note that maximizing 
I(p\W, f, c) is equivalent to maximizing 



I(p, $) = PiWii log -2- + fj log $j0 

»>1,3 Pi J «>1 



overp G £!(r) an d ^ ( an nx (m+l) stochastic matrix) jointly. 
This holds because, for fixed p, I(p, <£>) is maximized by 



PiW r , 



31 



J'O 



(20) 



and the maximum value is I(p\W,f,c). On the other hand, 
for fixed $, I(p, $) is maximized by 



Pi 



ae 



(21) 



where a is chosen such that J^iPi = !• This verifies the 
Karush-Kuhn-Tucker conditions. If r = 0, then (fJTJ) reduces 
to 

exp(c, + J2j Wij log $ji) 



Pi 



(22) 



Algorithm III simply alternates between (f20b and (fJJJ). 
At each iteration, the function 7(p|V^, /, c) never decreases. 
Theorem [TJ shows that Algorithm III converges to a global 
maximum. The proof uses the alternating minimization inter- 
pretation of J5); see Appendix B. 

Theorem 1 (monotonic convergence): Let pW be a se- 
quence generated by Algorithm III. Then lim^oo p 



(t) = p (oo) 



exists and, as t /* oo, 



J(pW|^,/,c)/ sup /(plW./.c). 

By Proposition [TJ (p^ 00 ' — r)/(l — r+) is a global maximizer 
of 7(p|W, 0,0) over p G f2. That is, Algorithm III correctly 
solves the optimization problem ([TJi in the limit. 

V. Rate of convergence 

Throughout this section the notation of Algorithm III is 
assumed. For example, W is defined via (fTOb . We derive 
a general formula (Theorem [2]i for the rate of convergence. 
Comparison results (Theorems [3] and [U show that Algorithm 
III is at least as fast the original ABA. Based on the com- 
parison theorems, a general recommendation is to let r and / 
("the squeezing parameters") be as large as permitted for fast 
convergence. 

Assume the iteration dT^ — (fl~3l> converges to some p* in the 
interior of f2(r), i.e., p* > r.i for all i. Denote the mapping 



from p 



(*) 



,(*+!) by M. Then p* = M(p*), i.e., p* is a 



fixed point. We emphasize that, because p* is assumed to lie 
in the interior of f2(r), so are all jjw for large enough t. Hence 
( TOb eventually takes the form of d22l . i.e., 



P 



(*+i) 



expfc + Ej Wij logoff) 



E ; >iCx P (Q + E J ^iog^ ; ) 

We call R(p*) = dM(p*) / dp the (m x m) matrix rate of 
convergence of Algorithm EH, because 

Jt+i) 



for pW near p*. The spectral radius of i?(p*), written as 
S(R(p*)), is called the global rate of convergence. (The 
smaller the rate, the faster the convergence.) Such notions are 
not uncommon in analyzing fixed point algorithms (see, e.g., 
|[6) and ifTTI ). Technically, the global rate should be defined 
as the spectral radius of a restricted version of R(p*), because 
(p(*) _ p*)l m = 0. However, the spectral radius of R{p*) is 
the same without this restriction (see Appendix D). 

The matrix R{p*) admits a simple formula (Theorem O; 
see Appendix C for its proof. 
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Theorem 2 (rate of convergence): We have 

R(P*) = Ira - W% (23) 

where the n x m matrix = i^ji) is specified by 

* j4 = * 3 -i(p*) + p*$ j0 (f>*), 1 < i < n, 1 < i < m, (24) 

and is $ji as in d20t when taking p = p*. 

For the original ABA, we have R(p*) = I m - W$(p*), 
which can be broadly interpreted as a measure of how noisy 
the channel is. If m = n and W approaches I m , then so does 
$(p*), and R{p*) approaches zero. At the opposite end, if 
rows of W overlap almost entirely, then WQ(p*) is nearly 
singular, leading to a large S(R(p*)), and slow convergence 
for ABA. See Corollary Q] for a more quantitative statement. 

Example 1 (continued). The maximizer of I{p) is p = 
(1/2, 1/2). The matrix rates are calculated for ABA (Rq) and 
for Algorithm III with r = and / = (1/6, 1/3, 1/6) (Ri), 
which is equivalent to Algorithm I with A = 5/3: 



Ro = 0.275 



1 -1 
-1 1 



i?i = 0.125 



1 -1 
-1 1 



The global rates are S(R ) = 0.55 and 5(i?i) = 0.25. Thus 
we confirm the advantage of this choice of A for Algorithm I. 
For Algorithm III with r = (1/8, 1/8) and / = (0, 1/4, 0), 
the global rate is zero. 

Propositions [2] and [3] explore basic properties of R(p*); see 
Appendix D for the proofs. 

Proposition 2: We have 

1 R(p*)l m = 0; 

2 if / = 0, then R(p*) is diagonalizable. 
Proposition 3: If d is an eigenvalue of R(p*), then d is 

real and < d < 1. 

Propositions [2] and [3] are used in deriving our main compar- 
ison results for convergence rates. Let us write the global rate 
for Algorithm III as R(r, f) to highlight its dependence on 
the vectors r and /. The global rate for ABA is R(0, 0). The 
different algorithms under comparison are assumed to deliver 
the same final output p. 

Theorem [3] presents an exact relation between the global 
rates for the same r but different /; see Appendix D for its 
proof. 

Theorem 3: We have 



R(r,f) = (l + f+)R(r,0)-f. 



+ ■ 



(25) 



Consequently, R(r, f) < R(rJ) if /+ > /+. 

Remark. By writing R(r, /), we already assume that (TTOb 
is satisfied with / in place of /. See also Theorem [4] below. 

For fixed r, Theorem [3] simply recommends large values of 
/+ for fast convergence. In view of the constraint ( fT6] l, this 
implies that, given r, R(r, f) is minimized by 



fj = 



mini W*j 



l<j<n, 



(26) 



where W* is defined in ( TBI . This also leads to a nontrivial 
bound on R(0,0). 

Corollary 1: R(Q, 0) > J2j min ; w ij- 



Proof: We have R(r,0) > /+/(1 + /+) from ([25), since 
R( r , f) > 0. The claim follows by choosing r = and / as 
in ■ 

Corollary Q] formalizes the intuition that ABA is likely to be 
slow when there exists heavy overlap between rows of W. The 
quantity • mini Wij is, in a sense, a conservative measure 
of this overlap. 

To compare the global rates for different values of r, it is 
convenient to write 

9 = ^^rW + /. (27) 
l-r+ 

Then / can be recovered from g via 

f = g-(l + g+)rW, g+=gl n . (28) 

Let us define 

R(r,g) = R(r,f) 

in view of this correspondence. 

Corollary 2: For fixed r, R(r,g) decreases in g + . 
Proof: Noting 

./+ = (l + . 9+ )(l-r + )-l, (29) 

the claim follows from Theorem [3] ■ 
An advantage of using g is that its optimal choice does not 
depend on r. 

Proposition 4: For fixed r that satisfies ©, R(r, g) is 
minimized by 



min, ; Wi-. 



9j 



1 < j < n. 



(30) 



1 - J2k mm * W ik ' 

Proof: By direct calculation, (f30b follows from (|27T > and 



Theorem |4] compares the global rates as a function of r 
when g is fixed. The proof is presented in Appendix D. 

Theorem 4: For fixed g, R(r,g) decreases in r/(l — r + ), 
i.e., 



> 



ii(r, fl ) <R(f,g). 



Theorem |4] is relatively strong. It implies Corollary |3] as 
can be verified from Theorem [3] and (|27] i. 

Corollary 3: For fixed /, i?(r, /) decreases in r/(l — r + ). 
Consequently i?(r, /) decreases in r. 

Overall the function R(r,g) decreases in both r/(l — r + ) 
and g. Since the original ABA corresponds to (r, <?) = (0, 0), 
Algorithm III is never worse than the original ABA in terms 
of the global rate. 

Corollary 4: We have 

R(rJ)=R(r,g)<R(0,0) = R(0,0). 

Theorem [4] and Proposition [4] lead to a general rule for 
choosing the "squeezing parameters". One should choose the 
largest allowable g as specified by ( f3Qb . and then choose a 
large r/(l — r + ) subject to (0). For m = 2 this resolves the 
optimal choice of (r, g) completely. 

Corollary 5: If m = 2, then R(r, g) is minimized when g 
satisfies (f30b and r satisfies the equalities in dH) and ((5). 
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For general m > 2, finding the optimal r appears nontrivial. 
Fortunately, the optimal r is not strictly necessary for achiev- 
ing substantial improvements. In Examples 1 and 2, Algorithm 
I, i.e., r = 0, is already considerably faster than ABA. If the 
optimal r is difficult to find, an option is to fix some q € fi, 
and set r — 8q, 8 > 0. The constraint reduces to 

o < mm V- . 

" ij {qW) 3 

Then we can set 8 at this upper bound. We leave the choice 
of q as an open problem for further investigations. 

Remark. Results in this section carry over to Algorithm II 
since Algorithm III is equivalent to Algorithm II with A given 
by ( fT7l >. By ( fTTI l. for example, ( |29| i simply says 1 + g + = A. 
Hence Corollary [2] recommends setting A at its upper bound 
in ©. In view of ( fTTI l, it is not surprising that in Theorem 
[3] and Corollary [2] the vectors / and g enter the picture only 
through f + and g + . 

VI. Summary and discussion 

A simple "squeezing" strategy is studied for speeding up the 
Arimoto-Blahut algorithm for discrete memoryless channels. 
This strategy introduces auxiliary vectors r and / and refor- 
mulates the problem so as to reduce the overlap between rows 
of the channel matrix W. A desirable feature of the resulting 
Algorithm II/III is that it improves ABA without sacrificing 
its simplicity or monotonic convergence properties. 

The effectiveness of Algorithm II/III is limited by the 
availability of large values of r and /. If the constraint (fTob 
forces both r and / to be close to zero, then we can expect 
little improvement from Algorithm II/III. Simply put, some 
channel matrices are not very "squeezable." Nevertheless, 
modifications can conceivably be designed for such situations. 
For example, suppose the input alphabet is ordered so that the 
overlap between conditional distributions Wi is most severe 
between adjacent i's. Then a natural strategy is to apply Algo- 
rithm II to update the probabilities for one neighborhood of i's 
at a time, holding the remaining components fixed. Potential 
applications, e.g., to the discrete-time Poisson channel (|14|, 
[9|), will be reported in future works. 

An open problem is to determine the optimal squeezing 
parameters, i.e., the values of r and / that produce the fastest 
Algorithm III. While the results in Section V paint a general 
picture, further theoretical studies may lead to extensions and 
refinements. If the optimal choice is difficult to derive or to 
implement, empirical studies may suggest effective rules. 
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Appendix 
A: Waterfilling for © 
We need to determine 8 = 8^ such that 

y^max-fa, Sxi} = 1, 



where SEj = p^f 1 exp ^Az,|*^ as in ©. This is feasible with 
8 > because J^. < 1. 
Step 1. Sort Ti/xi, say 

n<n<...<z™. 

X\ x 2 x m 

Step 2. Calculate the cumulative sums r* = Y^jLi r i anc ' 
Xi* = J2]=i x ji « = 1, • ■ ■ , to. By convention rj^ +1 = xo* = 
0. 

Step 3. Locate the largest index i 6 {1, . . . , m} such that 

j * . 

Xi* + ^ 1. 

Xi 

Set 8 = (1 -r* +1 )/x„. 

The overall time cost is 0(m log to) due to Step 1. 

B: Proof of Theorem[T] monotonic convergence 

Algorithm III is seen as an alternating divergence minimiza- 
tion procedure between convex sets of measures (151. (4)). Let 
X = {0, 1, . . . , to} and Y = {1, . . . , n}. Let V be the set of 
measures on X x Y of the form P = (Pij), 



P — 



1 < i < m 
i = 



where p — (p%, . . . ,p m ) £ f2(r). Let Q be the set of measures 
on X x Y of the form Q — (Qij), 



$jiWije Cz , 1 < i < m 



i = 



where $ji > and £™ $^ = 1. Observe that (i) both V 
and Q are convex; (ii) I(p, $) = —D(P\\Q); and (iii) (fjojl and 
(f2"TT i correspond to minimizing D(P\\Q) over Q for fixed P, 
and over P for fixed Q, respectively. The claim then follows 
from Theorem 3 of Csiszar and Tusnady 0. 

C: Proof of Theorem |2j convergence rate 

With a slight abuse of notation let &ji(p) and Pi($) be 
functions given by (f20b and ( f22l respectively. Then (1 < i, k < 

to, 1 < j < ri) 



dpk 



$i<(p)(l-*j<(p))Pi\ k = ^ 



We calculate R(p*) as 



k = i; 



(31) 



(32) 



d P (Hp)) 



dp 



p=p* 



Write $* = Then p($*) = p*. These relations and 

(fJTT l and (l32l are used repeatedly. 
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For i 7^ k, 1 < i, k < m, we have 



dPi($(p*)) 
dpk 



E 



''•I.,, 



dp k d^ jk dp k 

E# 



d$ji dp k 



"k 



Pi 



+E E 

- E [Q--Pi)Wkj**ji +Ptw kj (i - 

(33) 

+ ^ P *(l-$; _$* i _$* it )Wiy 



(34) 



where (f33T > uses (|20t , 

For 1 < k < m, a similar calculation yields 

dp fe ($(p*)) 



l-E^(*i*+Wo)- (35) 



Alternatively, d35l l can be derived from (f34T > and 

The identity d23l is just (l34l > and (l35l l in matrix format. 



(36) 



D: Convergence rates: properties and comparisons 

This section proves Propositions [2] and [3] and Theorems [3] 
andH] The notation is the same as in Section V. 

Part 1 of Proposition |2] follows from (f36b . For further 
analysis, define 

l " ~ lmr W, s = p*W*, D s = Diag(s). 



W* 



That is, D s is the diagonal matrix with s as the diagonal 



entries. Also let D v 
obtain 



Diag(p*). From 



and (O, 



Thus ( l23l can be written as 

R{P*) - Im - (1 + /+)# + £ 

where 

L = lmfD^W^Dp 



(37) 



A' = W*D~ 1 W* T D V 



Observe that D)!, 2 KD. 



1-1/2 



is symmetric and nonnegative 



definite. Thus K is diagonalizable and has only nonnegative 
eigenvalues. When / = 0, we have R(p*) = I m — K. Thus 
R(p*) is diagonalizable in this case. This proves Proposition 


Define a space of row vectors T = {7 G R m : jl m = 0}. 
For an m x m matrix A such that jA G T whenever 7 G T, we 



write So(A) as the spectral radius of A when restricted as a 
linear transformation on V. Suppose A satisfies Al m = 0, and 
suppose d is a nonzero eigenvalue of A, with a corresponding 
left eigenvector 7. Then 



= jAl m = djl r 



7 g r. 



Hence the set of nonzero eigenvalues is unchanged when A is 
restricted to V. In particular, 



S(A)=S (A). 



(38) 



We have 7L = for any 7 G T. Thus R(p*) and I m - (1 + 
f+)K represent the same linear transformation when restricted 
to T. Also, i?(p*)l m = by Proposition Q] By the preceding 
discussion, if d is a nonzero eigenvalue of R(p*), then <i is an 
eigenvalue of I m — (1 + f+)K. Equivalently, (1 — d)/(l + /+) 
is an eigenvalue of A. We know d < 1 because A only has 
nonnegative eigenvalues. On the other hand, because 1 — d 
is an eigenvalue of the stochastic matrix W^, the Frobenius- 
Perron theorem implies that |1 — d\ < 1, i.e., d > 0. This 
proves Proposition |3 

We also have 



R(r,f) 



S (R(p*)) (39) 
= S (I m - (1 + f+)K) 

= {l + f+)S (I m -K)-f + (40) 

= (1 + f+)S(I m - A) - /+ (41) 

= (l + f+)R(r,0)-f+. (42) 

Identity (|39l follows from d38l . Identity (l40l > holds because, 
by Proposition [3] the spectral radii involved refer to the largest 
eigenvalues. Because (I m —K)l m — 0, we have (HTt . Identity 
(l42l > holds because An — A is precisely the matrix rate of 
Algorithm III that uses (r, 0) in place of (r, /). Thus we have 
proved Theorem [3] 

Proof of Theorem [?} By (|28T > and Theorem [3] we have 

1 - R(r, g) = l- R(r, g - (1 + g+)rW) 

= (l+. 9+ )(l-r+)(l-A(r,0)). 

Thus, to prove R(r,g) < R(r,g), we only need 

(1 - r+)(l - R(r, 0)) > (1 - r+)(l - R(r, 0)). 

Let us only consider r = 0, i.e., 

(1 -r+)(l --R(r,0)) > 1 -R(0,0). (43) 

The general case reduces to this special one (details omitted) 
if we replace W by 

An. l^^Trr 

-W, 



1-n 



and r by r — (1 — r+)f/ (1 — f+). 
By d37] >, we have 



A(r,0) = S(/ m -[/AC/ T .D p ,) 



where 



I rn 1 r 



1 



F = WDZ 1 W T , 



p*UW 



--pW, 



(1 - r + )p + r, 
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and p denotes the (same) final output of Algorithm III using 
(r,0) or (0,0) for (rj). Define 



A = FU ' D p * . 



(44) 



The same argument leading to Proposition [3] and Theorem [3] 
shows that all eigenvalues of A are in the interval [0,1], and 



S(I m -A) = {1- r+)R{r, 0) + r+. (45) 



Define 



C = F^pF 1 ' 2 ; 

A = F 1/2 U T D r F 1/2 (46) 

= F 1 ' 2 (d p + D [Zl TV ) Fl/2 ~ C - 

Comparing d46l > with d44b shows that A and A have the same 
set of eigenvalues. Let a be the smallest eigenvalue of A, 
and let j3 be a corresponding right eigenvector. Then a = 
1 - S(I m - i), and by g5), 

a = 1 - S(I m - A) = (1 - r+)(l - fl(r,0)). (47) 

By direct calculation, we have 

aCf3 = CAP = [(1 - r+)C + F l ' 2 r T rF^ 2 ]/3. (48) 

If a = 1 - r+, then gHJ gives F 1 / 2 ^ rF 1 ' 2 (3 = 0, which 
implies 

f3 T F 1/2 r T rF 1/2 /3 = 0; rF 1/2 /3 = 0; /3 T C* = 0. 
Thus, 

a(3 T p = p T Ap (49) 
= f3 T F 1 / 2 (d p + F x l 2 f5 

> (3 T F 1/2 DpF 1/2 /3 

> (l-i?(0,0))/3 T /3, (50) 
where (f50T > follows from 

i?(0, 0) = S(I m - FDp) = S(I m - F 1 ' 2 D p F 1 ' 2 ). 

We deduce a > 1 — i?(0, 0) and conclude the proof of (|43"T >. If 
a/l-r + , then (|47l i implies a + r + — 1 < 0, and (|48l > leads 
to 

pT Cj3 _ P T F 1/2 r T rFV*0 
a + r + - 1 

Calculations similar to (f49b— (|50b yield the same conclusion, 
i.e., a > 1 - R(0, 0). ■ 
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